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ABSTRACT. We characterize words which cluster under the Burrows-Wheeler transform as those 
words w such that ww occurs in a trajectory of an interval exchange transformation, and build 
examples of clustering words. 

In 1994 Michael Burrows and David Wheeler [1J introduced a transformation on words which 
proved very powerful in data compression. The aim of the present note is to characterize those 
words which cluster under the Burrows-Wheeler transform, that is to say which are transformed 
into such expressions as 4 a 3 b 2 c l d or 2 a 5 6 3 c l d 4 e . Clustering words on a binary alphabet have al- 
ready been extensively studied (see for instance in [[81QT)) and identified as particular factors of the 
Sturmian words. Some generalizations to r letters appear in [flT], but it had not yet been observed 
that clustering words are intrinsically related to interval exchange transformations (see Definitions 
Q]and[2]below). This link comes essentially from the fact that the array of conjugates used to define 
the Burrows-Wheeler transform gives rise to a discrete interval exchange transformation sending 
its first column to its last column. It turns out that the converse is also true: interval exchange 
transformations generate clustering words. Indeed we prove that clustering words are exactly those 
words w such that ww occurs in a trajectory of an interval exchange transformation. On a binary 
letter alphabet, this condition amounts to saying that ww is a factor of an infinite Sturmian word. 
We end the paper by some examples and questions on how to generate clustering words. 

This paper began during a workshop on board Via Rail Canada train number 2. We are grateful to 
Laboratoire International Franco-Quebecois de Recherche en Combinatoire (LIRCO) for funding 
and Via for providing optimal working conditions. The second author is partially supported by a 
grant from the Academy of Finland. 

1. Definitions 

Let A = {a\ < a 2 < ■ ■ ■ < a r } be an ordered alphabet and w = Wi ■ ■ -w n a primitive word 
on the alphabet A, i.e. w is not a power of another word. For simplification we suppose that each 
letter of A occurs in w. 

The Parikh vector of w is the integer vector (m, . . . , rife) where is the number of occurrences 
of dj in w. The (cyclic) conjugates of w are the words W{ ■ ■ ■ w n Wi ■ ■ ■ w^i, 1 < i < n. As w is 
primitive, w has precisely n-cyclic conjugates. Let w i:1 w ii2 ■ ■ ■ w iyU denote the 2-th conjugate of w 
where the n-conjugates of w are ordered in ascending lexicographical order. Then the Burrows- 
Wheeler transform of w, denoted by B(w), is the word tfi^^.n ■ ■ • w n ^ n . In other words, B[w) is 
obtained from w by first ordering its cyclic conjugates in ascending order in a rectangular array, 
and then reading off the last column. We say w is ix -clustering if B(w) = a™J x • • • a™£' , where 
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7r 7^ Id is a permutation on {1, . . . , r}. We say w is perfectly clustering if it is 7r-clustering for 

rci = r + 1 — i, 1 < i < r. 

Definition 1. A (continuous) r-interval exchange transformation T with probability vector 
«2, • • • , ct r ), and permutation ix is defined on the interval [0, 1 [, partitioned into r intervals 

by 

Tx = x + t» when x G Aj, 



where n = E ff -i (,-)<*-! (i) a i ~ £j 



Intuitively this means that the intervals Aj are re-ordered by T following the permutation tt. We 
refer the reader to [fl3"l which constitutes a classical course on general interval exchange transfor- 
mations and contains many of the technical terms found in Section [3] below. Note that our use 
of the word "continuous" does not imply that T is a continuous map on [0, 1[ (though it can be 
modified to be made so); it is there to emphasize the difference with its discrete analogous. 

Definition 2. A discrete r-interval exchange transformation T with length vector (ni, n 2 , . . . , n r ), 
and permutation n is defined on a set of n\ + ■ ■ • + n r points x±, . . . , x ni+ ... +rir partitioned into r 
intervals 

Aj = {X k , J^Tlj < k < J^flj} 

by 

Tx fc = £ fc+s . when x fc G Aj, 
where Sj = E w -i (,•)<*-! « " ^ 



We recall the following notions, defined for any transformation T on a set X equipped with a 
partition Aj, 1 < % < r. 

Definition 3. The trajectory of a point x under T is the infinite sequence (x n ) ne ^ defined by 
x n — i if T n x belongs to Aj, 1 < % < r. The mapping T is minimal if whenever i? is a nonempty 
closed subset of X and T^E = E, then E — X. 



2. Main result 

Theorem 1. Lef w = Wi ■ ■ ■ w n be a primitive word on A = {1, . . . , r}, swc/i ?/za? every fetter o/A 
occurs in w. The following are equivalent: 

(1) u; ij re -clustering, 

(2) iyu7 occurs in a trajectory of a minimal discrete r-interval exchange transformation with 
permutation n, 

(3) ww occurs in a trajectory of a discrete r-interval exchange transformation with permuta- 
tion 7T, 

(4) ww occurs in a trajectory of a continuous r-interval exchange transformation with permu- 
tation TV. 
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Proof. ((2), (3) or (4) implies (1)) By assumption there exists a point x whose initial trajectory of 
length 2n is the word ww. Consider the set E = {Tx, T 2 x, . . . , T n x}. Then for each y 6 E, the 
initial trajectory of y of length n, denoted 0(y), is a cyclic conjugate of w. 

Suppose y and z are in E, and y is to the left of z (meaning y < z.). Let j be the smallest 
nonnegative integer such that T'jy and T J ' z are not in the same Aj. Then T^y is to the left of T^z, 
either because j = or because T is increasing on each Aj. Thus 0(y) is lexicographically smaller 
thatO(z). 

Thus B(w) is obtained from the last letter l(y) of 0(y) where the points y are ordered from left 
to right. But l(y) is the label of the interval Aj where T n ~ l y, or equivalently T~ x y, falls. Thus by 
definition of T, if y is to the left of z then Tx~ l {l{y)) < tx~ 1 {1(z)), and if y' is between y and z with 
/(?/) = then l(y') = l(y) = l(z), hence the claimed result. □ 

Proof. ((2) implies (3) implies (4)) The first implication is trivial. The second follows from 
the fact that the trajectories of the discrete r-interval exchange transformation with length vec- 
tor (ni, n 2 , . . . , n r ), and permutation tx, and of the continous r-interval exchange transformation 
with probability vector ( ,"* +n , • • • , n + "T +n ) an d permutation tx are the same. We note that this 
continuous interval exchange transformation is never minimal, while the discrete one may be. □ 

We now turn to the proof of the converse, which uses a succession of lemmas. Throughout this 
proof, unless otherwise stated, a given word w is a primitive word on {1, . . . , r}, and every letter 
of {1, ... , r} occurs in w; (ri\, . . . , n r ) is its Parikh vector, the w it \ ■ ■ ■ w ijTl are its conjugates. 

The first lemma states that B is injective on the conjugacy classes, which is proved for example 
in |[2l or [fl2~l : we give here a short proof for sake of completeness. 

Lemma 2. Every antecedent of B{w) by the Burrows-Wheeler transform is conjugate to w. 

Proof. In the array of the conjugates of w, each column word wij ■ ■ ■ w n j has the same Parikh 
vector as w, so we retrieve this vector from B(w); thus we know the first column word, which 
is l ni . . . r Ur , and the last column word which is B{w). Then the words w n jWij are precisely 
all words of length 2 occurring in the conjugates of w, and by ordering them we get the first two 
columns of the array. Then w n jWijW2j constitute all words of length 3 occurring in the conjugates 
of w, and we get also the subsequent column, and so on until we have retrieved the whole array, 
thus w up to conjugacy. □ 

It is easy to see that B, viewed as a mapping from words to words, is not surjective (see for 
instance lfT2l ). A more precise result will be proved in Corollary [4]below. 

Lemma 3.1fw is tx -clustering, the mapping wij \-> w n j defines a discrete r-interval exchange 
transformation with length vector (ni, n 2 , . . . , n r ), and permutation tx. 

Proof. We order the occurrences of each letter in w by putting Wi < Wj if the conjugate 

Wi ■ ■ • w n Wi ■ ■ ■ Wi-i is lexicographically smaller than Wj ■ ■ ■ w n W\ ■ ■ ■ Wj-i. By primitivity, the n 

letters of w are uniquely ordered as 

li < ■ ■ ' < lm < 2i < ■ ■ • < 2„ 2 < ■ • • < n < • ■ ■ < r nr , 

and the first column word is l x ■ ■ • l ni 2i ■ • • 2„ 2 • • • r\ ■ ■ ■ r„ r . We look at the last column word: if 
w n j and iWnj+i are both some letter k, the order between these two occurrences of k is given by the 
next letter in the conjugates of w, and these are respectively wij and iwij+i- Thus w n j < w n j + i. 
Together with the hypothesis, this implies that the last column word is 

■ ■ ■ (Txl) n ^ ■ ■ ■ (Txr) 1 ■ ■ ■ (nr) n ^ r . 



4 



S. FERENCZI AND L.Q. ZAMBONI 



Thus, if we regard the rule W\^ w n j as a mapping on the ri\ + . . . + n r points 

{lli---)lni)2i,... 2 n2 j . . . , n, . . . , ?"n r }) 

and put Aj = {ii,... , i n .}, we get the claimed result. □ 

Corollary 4. If the discrete r -interval exchange transformation T with length vector (ni, n 2 , . . . , n r ), 
and permutation ix is not minimal, the word {nl) nwl . . . [Tir) nwr has no primitive antecedent by the 
Burrows -Wheeler transform. 

Proof. Let w be such an antecedent. By the previous lemma, the map Wij i— >■ w n j corresponds 
to T. If T is not minimal, there is a proper subset E of {li, . . . , l m ,2i, . . . 2 n2 , . . . ,ri, . . . , r„ r } 
which is invariant by w^- !->■ u> n j. Thus in the conjugates of w, preceding any occurrence of a 
letter of E is another occurrence of a letter of E. This implies that w is made up entirely of letters 
of E, a contradiction. □ 

Proof. ((1) implies (2)) Let w be as in the hypothesis. Then B(w) = (7rl) nnl ■ ■ ■ (irr) n7rr . Thus 
the transformation T of Lemma @] is minimal, and thus has a periodic trajectory w'w'w' . . ., where 
m/ has Parikh vector (m, . . . , n r ). If to' = w fc , then = fcn^ for all z, and the set made with the v! i 
leftmost points of each Aj is T-invariant, thus w' must be primitive. 

By the proof, made above, that (2) implies (1), w' is 7r-clustering. Hence B(w') = B{w) and, 
by Lemma[2l w is conjugate to w', hence ww occurs also in a trajectory of T. □ 

Some of the hypotheses of Theorem [T]may be weakened. 

Alphabet. {1, . . . , r} can be replaced by by any ordered set A = {a\ < a 2 < ■ ■ ■ < a r } by us- 
ing a letter-to-letter morphism. Thus for a given word w, we can restrict the alphabet to the letters 
occurring in w. Note that if ww occurs in a trajectory of an r-interval exchange transformation, 
but only the letters j±, . . . , jd occur in w, then, by the reasoning of the proof that (4) implies (l),w 
is Tr'-clustering, where it' is the unique permutation on {1, . . . , d} such that (7r') _1 (y) < (tt')' -1 ^) 
iff 7r~ 1 ( J 7j / ) < n~ 1 (j z ). If 7r is a permutation defining perfect clustering, then so is 

Primitivity. The Burrows-Wheeler transformation can be extended to a non-primitive word 
W\ ■ ■ ■ w n , by ordering its n (non necessarily different) conjugates Wi ■ ■ ■ w n w\ ■ ■ ■ Wi-\ by non- 
strictly increasing lexicographical order and taking the word made by their last letters. 

In this case the result of Lemma|4]does not extend: For example B(1322313223) = 3333222211 
though the discrete 3-interval exchange transformation with length vector (2,2,4), and permuta- 
tion 7rl = 3, 7r2 = 2, 7r3 = 1 is not minimal. Note that if (7rl) n7rl • • • (nr) n7Tr has a non-primitive 
antecedent by the Burrows-Wheeler transform, then the n £ have a common factor k. There exist 
(see below) non-minimal discrete interval exchange transformations which do not satisfy that con- 
dition, and thus words such as 32221 which have no antecedent at all by the Burrows-Wheeler 
transformation. 

But our Theorem [T]is still valid for non-primitive words: the proof in the first direction does not 
use the primitivity, while in the reverse direction we write w = u k , apply our proof to the primitive 
u, and check that u 2k occurs also in a trajectory. 

Two permutations. An extension of Theorem[T]which fails is to consider, as the dynamicians do 
[fT3l , interval exchange transformations defined by permutations it and 7r'; this amounts to coding 
the interval Aj by ir'i instead of i. A simple counter-example will be clearer than a long definition: 
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take points x±, . . . , Xg labelled 223331111 and send them to 111133322 by a (minimal) discrete 3- 
interval exchange transformation, but where the points are not labelled as in Definition 3 (namely 
Tx± = xg, Txs = £5 etc...)- Then w = 123131312 is such that ww occurs in trajectories of T but 
B(w) = 323311112. 

3. Building clustering words 

Theorem [TJ provides two different ways to build clustering words, from infinite trajectories ei- 
ther of discrete (or rational) interval exchange transformations or of continuous aperiodic interval 
exchange transformations. For r = 2 and the permutation nl = 2,7r2 = 1, the first ones give 
all the periodic balanced words, and the second ones gives (by Proposition [5] below) all infinite 
Sturmian words: both these ways of building clustering words on two letters are used, explicitly or 
implicitly, in [|8l . 

The use of discrete interval exchange transformations leads naturally to the question of charac- 
terizing all minimal discrete r-interval exchange transformations through their length vector; this 
has been solved by |[T0l for n = 3 and irl = 3, tt2 = 2, 7r3 = 1: if the length vector is (ni, n 2 , n 3 ), 
minimality is equivalent to (77,4 + n 2 ) and [n 2 + n 3 ) being coprime. Thus 

Example 1. The discrete interval exchange 111122333 —> 333221111, gives rise to the perfectly 
clustering word 122131313. 

The same reasoning extends to other permutations: for ttI = 2, tt2 = 3, tt3 = 1, minimality is 
equivalent to ri\ and (77,2 + 77.3) being coprime; for ttI — 3, 7r2 = l,7r3 = 2, minimality is equivalent 
to 77,3 and (n 2 + n{) being coprime; for other permutation on these three letters, T is never minimal. 

For r > 4 intervals, the question is still open. An immediate equivalent condition for non- 
minimality is Y^iLi s m = for m < rix + • ■ - + n r and W\ ■ ■ ■ w m a word occurring in a trajectory. It 
is easy to build non-minimal examples satisfying such an equality for simple words w, for example 
for r = 4 and 7rl = 4, 7r2 = 3,7r3 = 2, 7r4 = 1, ri\ — n 2 — — 1 gives non-minimal examples 
for any value of 77,4, the equality being satisfied for w = 2A q if n 4 = 3g, w = 14 9+1 if n 4 = 3q + 1, 
w = 34 9 if 77,4 = 3q + 2. Similarly, the following example shows how we still do get clustering 
words, but they may be somewhat trivial. 

Example 2. The discrete interval exchange 111233444 — > 444332111 satisfies the above equality 
for w = 14; it is non-minimal and gives two perfectly clustering words on smaller alphabets, 41 
and 323. 

To study continuous aperiodic interval exchange transformations we need a technical condition 
called i.d.o.c. O which states that the orbits of the discontinuities ofT are infinite and disjoint. 
It is proved in [|9l or in [fT3l that this condition implies aperiodicity and minimality, and that, if 7r 
is primitive, i.e. . . . , d} ^ {1, . . . , d} for d < r, then the r-interval exchange transformation 
with probability vector (ai, . . . , a r ) and permutation ir satisfies the i.d.o.c. condition if ax, . . . , a r 
and 1 are rationally independent. We can now prove 

Proposition 5. Let w = w±- ■ -w n be a primitive word on A = {1, . . . , r}, such that every letter 
of A occurs in w. Then w is n -clustering if and only if ww occurs in a trajectory of a continuous 
r-interval exchange transformation with permutation it, satisfying the i.d.o.c. condition. 

Proof The "if direction is as in Theorem [TJ To get the "only if, we generate w by a minimal 
discrete interval exchange transformation as in (2) of Theorem [TJ and thus it is primitive. Then 
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we replace it by a continuous periodic interval exchange transformation as in the proof that (3) 
implies (4). But, because cylinders are always semi-open intervals, if a given word ww occurs in 
a trajectory of a continuous r-interval exchange transformation with permutation ir and probabil- 
ity vector (aii, . . . , a r ), it occurs also in trajectories of every r-interval exchange transformation 
with the same permutation whose probability vector is close enough to (cti, . . . , ot r ). Thus we can 
change the a, to get the irrationality condition which implies the i.d.o.c. condition. □ 

Trajectories of interval exchange transformations satisfying the i.d.o.c. condition may be explic- 
itly constructed via the self -dual induction algorithms of A3 for r = 3 and 7rl = 3,7r2 = 2,7r3 = l, 
j6l for all r and ni = r + 1 — i, and the forthcoming [4] in the most general case. More precisely, 
Proposition 4.1 of O shows that if the permutation is 7ti = r + 1 — i (or more generally if the 
permutation is in the Rauzy class of tti = r + 1 — i), then there exist infinitely many words ww in 
the trajectories. It also gives a sufficient condition for building such words: if a bispecial word w, 
a suffix s and a prefix p of w are such that pw = ws, then both pp and ss occur in the trajectories. 
In turn, a recipe to achieve that relation is given in (i) of Theorem 2.8 of [6]: we just need that in 
the underlying algorithm described in Section 2.6 of [|6], either p n (i) = i or m n (i) = i (except for 
some initial values of n, where, for % = 1, p and s are longer than w). Many explicit examples of 
ww have been built in this way. 

• For r = 3 in (5J, w = A k , w = B k in Proposition 2.10, 

Example 3. 13131312222 and 122131222131221313 are perfectly clustering. 

• For r = 4 in 0, w = M 2 (k), w = P 3 {k)M\{k) in Lemma 4.1 and in Lemma 5.1, 
Example 4. 2 m (3141) n 32 are perfectly clustering for any m > 3 and n > 2. 

• For all r = n in 10, w = P kjl>1 , w = P ktn _ iti+1 P kti+ljn _i, w = Mk^+^i^M^^+^i 
in Theorem 12; 

Example 5. 5252434252516152516161525161 is perfectly clustering. 

For other permutations, we shall describe in [4] an algorithm generalizing the one in flSJ . We 
also construct an example of an interval exchange transformation which does not produce infinitely 
many ww. For the permutation tt1 = 4,7r2 = 3,7r3 = l,7r4 = 2, examples can be found in 
Theorem 5.2 of D, with w = Pi, qn M 2 , qn , w = P 2 , qn M 3)qn , w = P 3 , qn M lj<ln , 

Example 6. 4123231312412 is n -clustering, 

We remark that our self-dual induction algorithms for aperiodic interval exchange transforma- 
tions generate families of nested clustering words with increasing length, and thus may be more 
efficient in producing very long clustering words than the more immediate algorithm using discrete 
interval exchange transformations. 
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